| Player | Home_Runs | PA | HR_Rate |
|---|---|---|---|
| Mickey Mantle | 536 | 9910 | 5.41 |
| Hank Aaron | 755 | 13941 | 5.42 |
2024-04-01
Frederick Mosteller
Carl Morris
Amstat News (March 2016):
“One cannot think of Dick without thinking of his enthusiam for sports. He played tennis and softball as long as he was able… He had his favorite teams and never missed an opportunity to watch one of their games.”
Sports is a great way to communicate statistical thinking.
Knowing both the sports application and the theory helps the research.
In sports, we learn how people in the world interpret and use data.
Who was the better home run hitter?
Look at some career statistics:
| Player | Home_Runs | PA | HR_Rate |
|---|---|---|---|
| Mickey Mantle | 536 | 9910 | 5.41 |
| Hank Aaron | 755 | 13941 | 5.42 |
Distinguishing home run performance and home run ability
Peak Ability – what is a player’s home run ability at his peak?
Career Ability – what is a player’s career home run abaility?
where the means satisfy a quadratic model
\[\mu_j = \beta_0 + \beta_1 * AGE_j + \beta_2 * AGE_j^2\]
Posterior estimates shrink observed HR rates towards the parabola
Get posterior of a player’s peak ability \[ PEAK = max_j \{p_j\} \]
Get posterior of age that achieves peak
Bill James 1982 Baseball Abstract “Looking for the Prime”
James: “Any successful statistical analysis of aging must find some way to deal with the ‘white space’”
This is a missing data problem
Sampling: Binomial sampling where the probabilities follow a logistic parabolic model
Prior: Sets of regression coefficients follow a multivariate normal distribution
Model fits “borrow strength” to get improved estimates at the fitted trajectories
Improved estimates at peak ages – players tend to peak about age 28
28 is a poor estimate of peak age of all MLB players.
Should account for the missing data.
Schuckers, Lopez and MacDonald (2024) illustrate estimating player aging curves using regression and imputation.
25 32 41 45 72 76 86 87 100 131 141 150 160 162 176 178 182 187 221 228 269 301 316 339 342 343 368 406 414 420 425 433 454 455 473 522 540 554 578 588 596 598 604 616 637 640 645 652
\[p_1 = ... = p_n = p\]
the \(p_j\) are different and distributed according to a Beta curve
Maybe players are truly consistent and we are observing “chance” streakiness due to multiplicity.
Are these streaky outcomes a result of a consistent model for hitting home runs for all players?
Find more streaky players in the observed data than one would predict based on simulations from a consistent model.
So patterns of home run streakiness are “interesting”.
Raises question: Do truly streaky hitters exist?
| Season | Home Runs |
|---|---|
| 2015 | 4909 |
| 2016 | 5610 |
| 2017 | 6105 |
| 2018 | 5585 |
| 2019 | 6776 |
| 2021 | 5944 |
| 2022 | 5215 |
| 2023 | 5868 |
Define the home run rate as the fraction of \(HR\) among all batted balls (\(AB - SO\))\[ HR \, Rate = \frac{HR}{AB - SO} \]
Look at history of \(HR\) rates
Fall of 2017 a committee was charged by Major League Baseball to identify the potential causes of the increase in the rate at which home runs were hit from 2015 to 2017.
Committee released two reports (May 2018 and December 2019)
The batters?
The pitchers?
The ball?
Game conditions?
IN-PLAY: Have to put the ball in play
HIT IT RIGHT: The batted ball needs to have the “right” launch angle and exit velocity
REACH THE SEATS: Given the exit velocity and launch angle, needs to have sufficient distance and height to clear the fence (the carry of ball)
Hit 62 home runs during a season when the ball was relatively dead
Raises the question: How many home runs would Judge hit during a different season during Statcast era?
Suppose the different season is 2019.
Fit a “2019 ball model” that predicts the probability of a HR in 2019 given values of the launch angle and exit velocity.
Collect the launch variables for Judge for all balls put into play. For each BIP, predict P(HR) using 2019 ball model.
Sum the probabilities – predict the season HR.
Express the logit of the home run probability as \[ \log \left(\frac{P(HR)}{1 - P(HR)}\right) = s(LA, LS) \]
\(s()\) is a smooth function of the launch angle (LA) and the launch speed (LS)
Generalization of the linear regression model \(y = X \beta + \epsilon\)
For each Judge’s ball in play in 2022, predict the probability of HR from the launch variables using the 2019 ball model.
Sum the probabilities – predict total HR count
Can get a 90% prediction interval
If Judge was hitting using a 2019 ball, predict he would hit 75 home runs
A 90% prediction interval would be (69, 81)
Use GAM model to predict prob(HR) from the launch angle and exit velocity for one season
Use this ball model to predict HR probability using 2022 launch variables
Sum prediction probabilities
Judge only hit 62 home runs in 2022
But if he was playing during a different season where the ball was more alive (more carry), the prediction of his 2022 count to be in the 70’s
So Judge’s home run achievement is understated
Due to this ball bias, we don’t appreciate magnitude of Judge’s accomplishment
Two important factors in home run hitting are the hitters (values of launch variables) and the ball (carry or drag coefficient).
Batters are stronger and changing their hitting approach, leading to higher rates of “HR friendly” balls in play.
The composition of the ball has gone through dramatic changes during the Statcast era.
Currently the ball is relatively dead compared to previous seasons.
Growing application of Statistics – have Statistics in Sports section in American Statistical Association
Journals such as Journal of Quantitative Analysis of Sports (JQAS) and the Journal of Sports Analytics (JSA).
Conferences such as New England Symposium on Statistics in Sports, Carnegie Mellon Sports Analytics Conference, and Saberseminar.
Abundance of free sports data available.
Tools for working with data (such as R) are readily available.
Analysis jobs and internships are available in professional sports teams.
College sports teams can benefit with an “analytics coach”.
Analyzing Baseball Data with R
3rd edition (with Max Marchi and Ben Baumer) available this summer
Online version available NOW at http://tinyurl.com/abdwr3e
Chapters on sources of baseball data, sabermetrics, and R